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[57] ABSTRACT 

A method and apparatus are provided for detecting and 
correcting various data errors that may arise in a mass 
data storage apparatus comprising a set of physical mass 
storage devices operating as one or more larger logical 
mass storage devices. More particularly, there is pro- 
vided a method and apparatus for determining, on resto- 
ration of power to a device set, whether or not a write 
operation was interrupted when power was removed, 
and for reconstructing any data that may be inconsistent 
because of the removal of power. 

2 Claims, 6 Drawing Sheets 
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number of disks in the array. To correct for this de- 

NON-VOLATTLE MEMORY STORAGE OF WRITE creased mean time to failure of the system, error recog- 

OPERATION IDENTIFIER IN DATA SOTRAGE nition and correction is built into the RAID systems. 

DEVICE The Patterson et al. reference discusses 5 RAID cm- 

5 bodiments each having a different means for error rec* 

BACKGROUND OF THE INVENTION ognition and correction. These RAID embodiments are 

The present invention relates to sets of physical mass referred to as RAID levels 1-5. 

storage devices that collectively perform as one or RAID level 1 utilizes complete duplication of data 

more logical mass storage devices. In particular, the and so has a relatively small performance per disk ratio, 

present invention relates to methods and apparatus for 10 RAID level 2 improves this performance as well as the 

maintaining data integrity across such a set of physical capacity per disk ratio by utilizing error correction 

mass storage devices in the event of a power failure. codes that enable a reduction of the number of extra 

Use of disk memory continues to be important in disks needed to provide error correction and disk fail- 
computers because it is nonvolatile and because mem- ure recovery. In RAID level 2, data is interleaved onto 
ory size demands continue to outpace practical amounts ^ a group of G data disks and error 'codes are generated 
of main memory. At this time, disks are slower than and stored onto an additional set of C disks referred to 
main memory so that system performance is often lim- as "check disks 1 ' to detect and correct a single error, 
ited-by disk access speed. Therefore, it is important for This error code detects and enables correction of ran- 
overall system performance to improve both memory dom single bit errors in data and also enables recovery 
size and data access speed of disk drive units. For a 2° 0 f data if one of the G data disks crashes. Since only G 
discussion of this, see Michelle Y. Kim, "Synchronized of the C+G disks carries user data, the performance per 
Disk Interleaving", IEEE Transactions On Computers, j s pr0 portional to G/(G+C). G/C is typically 
Vol. C-35, No. 1 1, November 1986. significantly greater than 1, so RAID level 2 exhibits an 

Disk memory size can be increased by increasing the improvement in performance per disk over RAID level 

number of disks and/or increasing the diameters of the 25 x Qne Qf more ^ ^ ^ includcd in the tem 

disks, but this does not increase data access speed. Mem- SQ that if one of thc disk drives m the dUk ^ 

ory size and data transfer rate can both be increased by be clectronically switchcd int0 the RAID to replace the 

increasing the density of data storage. However, tech- failed disk drive 

nological constraints limit data density and high density RAID level Sis a variant of RAID level 2 in which 

disks are more prone to errors, 30 , 

A variety of techniques have been utilized to improve the error detecting capabihues that, are provided by 

data access speed. Disk cache memory capable of hold- mos * '™ Un * ^"pensive disk drives are utilized to 

ing an entire track of data has been used to eliminate e ™ bl * the number d,sks t0 f be reduced t0 ° ne » 

seek and rotation delays for successive accesses to data thereb y mc J?f ^ | he i^*™ Performance Per d * k 

on a single track. Multiple read/write heads have been 35 ov «[ tnat ° f l^ 10 le ™ 2 ; r 

used to interleave blocks of data on a set of disks or on ™ e Performance criteria for small data transfers, 

a set of tracks on a single disk. Common data block sizes such as IS common in transaction processing, is known 

are byte size, word size, and sector size. Disk interleav- to P oor for levels U3 ^cause data is inter- 

ing is a known supercomputer technique for increasing lcavcd *mcn$ the disks in bit-sized blocks, such that 

performance, and is discussed, for example, in the 40 even for a data access of less than one sector of data, all 

above-noted article. disks must be accessed. To improve this performance 

Data access performance can be measured by a num- parameter, in RAID level 4, a variant of RAID level 3, 

ber of parameters, depending on the relevant applica- data is interleaved onto the disks in sector interleave 

tion. In transaction processing (such as in banking) data mode ins *ead of in bit interleave mode as in levels 1 -3. 

transfers are typically small and request rates are high 45 V* benefit of this is that, for small data accesses (i.e., 

and random. In supercomputer applications, on the accesses smaller than G+C sectors of data), all disks 

other hand, transfers of large data blocks are common. need not De accessed. That is, for a data access size 

A recently developed disk memory structure with between k and k+ 1 sectors of data, only k+ 1 data disks 

improved performance at relatively low cost is the need be accessed. This reduces the amount of competi- 

Redundant Array of Inexpensive Disks (RAID) (see, 50 tion among separate data access requests to access the 

for example, David A. Patterson, et al., "A Case for same data disk at the same time. 

Redundant Arrays of inexpensive Disks (RAID)", Re- Yet the performance of RAID level 4 remains limited 

port No. UCB/CSD 87/39, December, 1987, Computer because of access contention for the check disk during 

Science Division (EECS), University of California, write operations. For all write operations, the check 

Berkeley, California 94720. As discussed in the Patter- 55 disk must be accessed in order to store updated parity 

son et al. reference, the large personal computer market data on the check disk for each stripe (i.e., row of sec- 

has supported the development of inexpensive disk tors) of data into which data is written. Therefore, write 

drives having a better ratio of performance to cost than operations interfere with each other, even for small data 

Single Large Expensive Disk (SLED) systems such as accesses. RAID level 5, a variant of RAID level 4, 

the IBM 3380. The number of I/Os per second per 60 avoids this contention problem on write operations by 

read/write head in an inexpensive disk is within a factor distributing the parity check data and user data across 

of two of the large disks. Therefore, the parallel transfer all disks. 

from several inexpensive disks in a RAID architecture, Power failures present unique problems to RAID 

m which a set of inexpensive disks function as a single architectures that conventional error recognition and 

logical disk drive, produces greater performance than a 65 correction techniques will not handle reliably. In a 

SLED at a reduced price. conventional SLED storage system, write requests 

Unfortunately, when data is stored on more than one translate into write operations on a single disk. If a 

disk, the mean time to failure varies inversely with the power failure occurs during such a write request, it is 
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more likely that the operation will complete before the FIG. 4 illustrates RAID memory 304 of FIG. 3 in 

disk lpses power. However, in RAID architectures, greater detail and illustrates hardware used to detect 

write requests translate into write operations on multi- and correct data errors arising from power failure in 

pie disks. For RAID write operations, if a power failure accordance with the principles of the present invention; 

occurs during a write request, it may happen that only 5 FIG. 5 illustrates a data block layout that includes a 

some of the disks involved in the write request will time stamp data field for detecting and correcting data 

complete their write operations and that others will not errors arising from power failure in accordance with 

have started their write operations. Although this re- the principles of the present invention; 

suits in only a partial completion of the write request, FIG. 6 illustrates an array of mass storage devices in 

this failure to complete the write operation will not be 1° accordance with the principles of the present invention 

detected if the disks that did not complete the write including four data storage devices divided into two 

operation did not write any data at all. In addition, the data groups, and a check device; and 

data on the check disks across the data stripe that was FIG- 7 illustrates an example of the operation of a 

being altered may be invalid (i.e., not equal to the cor- data group including two mass storage devices, 

rect check information for the data disks), thereby lead- 15 DETAILED DESCRIPTION OF THE 

ing to the possibility that other sectors uninvolved in INVENTION 
the write request will also not be able to be regenerated 

subsequently. 1- Description of Exemplary Multiple Device Mass 

In view of the foregoing, it would be desirable to be Storage System 

able to provide a way to determine, on restoration of 20 To illustrate the principles of the present invention, a 

power to a multiple device mass storage system after a description is provided below of a multiple storage 

power loss, whether or not a write operation was inter- device mass storage system in which the present inven- 

rupted when power was removed, and to reconstruct tion is embodied. It is shown that the described multiple 

any data that may be inconsistent with other stored data ^ device mass storage system can be connected in various 

because of the removal of power. computer systems having conventional architectures. It 

SUMMARY OF THE INVENTION * aIso shown the described multi Ple device mass 

storage system can include, in addition to the present 

It is an object of the present invention to provide a invention, other novel means for maintaining data integ- 

way to determine on restoration of power to a multiple 3Q rity in the system. These other novel means are the 

device mass storage system whether or not a write subject of co-pending, commonly assigned U.S. patent 

operation was interrupted when power was removed, application Ser. No. 07/488,750, entitled "DATA 

It is another object of the present invention to pro- CORRECTIONS APPLICABLE TO REDUN- 

vide a way to reconstruct any data that may be inconsis- DANT ARRAYS OF INDEPENDENT DISKS", 

tent with other stored data because of the removal of 35 filed concurrently herewith in the names of David T. 

power during a write operation. Powers, Joseph S. Glider and Thomas E. Idleman. 

In accordance with the present invention, there is Although the present invention is described in the con- 
provided a method and apparatus for determining, on text of a multiple device mass storage system having a 
restoration of power to a set of physical mass multiple RAID architecture, it will be appreciated by one of skill 
storage devices, whether or not a write operation was 40 in the art that the present invention is useful in any 
interrupted when power was removed, and for recon- multiple device storage system architecture in which 
strutting any data that may be inconsistent because of data is interleaved across more than one physical stor- 
the removal of power. The apparatus includes at least age device. 

one check device for storing redundancy data, nonvola- In FIG. 1 is illustrated the general structure of a 

tile memory means, means for detecting a failure of 45 conventional channel architecture for routing data from 

power to the set of physical storage devices, means for, main memory in a central processing unit (CPU) to any 

on detection of a power failure during a storage opera- of a set of data storage devices 1 14-126. Data emerges 

tion, storing in the nonvolatile memory means informa- from the CPU main memory 101 along any one of a set 

tion regarding said storage operation, and means for, on of channels 102-104 and is selectively directed to one of 

restoration of power to the set of physical storage de- 50 a set of device controllers 105-113. The selected one of 

vices, reading the information regarding the storage these device controllers then passes this data on to a 

operation from the nonvolatile memory means and re- selected one of the data storage devices attached to that 

constructing data on the set of physical storage devices. controller. These data storage devices can be of a vari- 

In an alternative embodiment, information regarding ety of types, including tape storage, single disk storage 

a storage operation may be stored in the non-volatile 55 and RAID memory storage. Such storage devices may 

memory at the beginning of every storage operation, be coupled to more than one controller to provide mul- 

and is erased therefrom when the storage operation is tiple data paths between the CPU main memory 101 and 

completed. the storage device. This is shown, for example, by the 

BRIEF DESCRIPTION OF THE DRAWINGS a ° f St0rafie device 122 10 controUers m and 

FIG, 1 illustrates a prior art channel architecture for FIG. 2 illustrates an alternative conventional archi- 

routing data to various peripheral devices; tecture in which channels 102-104 are replaced by an 

. FIG. 2 illustrates a prior art bus architecture for rout- input/output (1/0) bus 201. The data storage devices 

ing data to various peripheral devices; used in such a system also can be of a variety of types, 

FIG. 3 illustrates coupling between RAID memories 65 including tape storage, single disk storage and RAID 

and device controllers in a multiple device mass storage memory storage. In both of these architectures, during 

system of a type suitable for use with the present ihven- any data access, several switches have to be set to con- 

tion; nect CPU main memory 101 to the storage device se- 
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lected for access. When the storage device is a RAID the data, when encoded using the Reed Solomon code 

memory, additional controls must be set to route the with the other data in the stripe, does not result in the 

data within the RAID memory. . check data stored for the stripe. For this reason, it may 

To explain, FIG. 3 shows in greater detail how a pair be desired to divide a multiple device mass storage 
301 and 302 of device controllers can be connected to a 5 system having a RAID architecture into a plurality of 
pair of RAID memories 304 and 305. Each device con- separate redundancy groups. Such an implementation is 
troller is connected by a bus or channel 319 to a GPU described in greater detail in co-pending, commonly 
main memory. In general, each RAID memory is at- assigned patent application Ser. No. 07/488,749, filed 
tached to at least two device controllers so that there concurrently herewith in the names of David H. Jaffe, 
are at least two parallel paths from one or more CPU 10 David T. Powers, Kumar Gajjar, Joseph S. Glider and 
mam memories 101 to that RAID memory. Thus, for Thomas E. Idleman, and entitled "DATA STORAGE 
example, each of RAID memories 304 and 305 is con- APPARATUS AND METHOD", which is hereby 
nected to device controllers 301 and 302 by busses 311 incorporated by reference in its entirety, 
and 312, respectively. As shown, bus 311 may also con- Assuming that the number- of drives to be recon- 
nect device controller 301 to additional RAID memo- 15 structed is within the limit imposed by the Reed Solo- 
ries. Such parallel data paths from the CPU to the mon code used, reconstruction is generally accom- 
RAID memory are useful for routing data around a plished as follows. First, all' data blocks across the re- 
busy or failed device controller. dundancy group stripe that includes the drive(s) to be 

Within each RAID memory are a set 306 of disk reconstructed are read. Also read is the check data 
drive units 307. This set includes an active set 308 of 20 corresponding to that stripe. Error correction circuitry 
disk drive units 307 and a backup set 309 of disk drive (e.g., redundancy group error correction circuitry 408 
units 307. In each of RAID memories 304 and 305 is a of FIG. 4) then uses the check data and the valid data 
RAID controller 310 that routes data between device blocks to regenerate the data that should have been 
controllers 301 and 302 and the appropriate one or ones written to each data block or block that is inconsistent 
of disk drive units 307. Hardware protocol controllers 25 with the remainder of the stripe. The error correction 
315 in each of the device controllers 30 and 302, and circuitry can be of any suitable type for manipulating 
corresponding hardware protocol controllers in each of the data in accordance with the algorithm of the panic- 
RAID memories 304 and 305 (e.g., protocol controllers ular Reed Solomon code used. How this circuitry gen- 
403 and 404 shown in FIG. 4), handle the transfer of erates the check data and how it regenerates inconsis- 
data between device controllers and RAID controllers. 30 tent data are not within the scope of the present inven- 
When one of the disk drive units in active set 308 fails, tion — it is intended that the present invention be applU 
RAID controller 310 switches the failed unit out of the cable to any system in which it is desired to be able to 
data path, recreates the failed drive unit's data and detect and correct data errors resulting from a failure to 
thenceforth reroutes that disk drive unit's input data to write one or more data blocks involved in a write opera- 
one of the disk drive units in backup set 309. Controller 35 tion, regardless of the particular reconstruction tech- 
310 utilizes the error correcting capability provided by nique used. 

the codes written onto check disks to reconstruct the Thus, by replacing a single, conventional physical 

data of the failed disk drive unit onto the backup unit storage unit with a set of disk drives operating together 

with which the failed disk drive unit has been replaced. as a larger unit ah additional level of data path branch- 

The particular method by which data on a drive in a 40 ing and switching is introduced that may incorrectly 

RAID architecture is reconstructed is implementation direct data to an incorrect disk drive unit, 

specific. In the preferred embodiment a Reed Solomon ■ _ . j tx * 

coding algorithm is used to calculate the check data that 1 Detection Of Misrouted Data 

is stored on the check drives. In a particularly preferred The RAID memory can be provided with means for 

embodiment this check data is distributed across several 45 detecting incorrectly routed data. This is preferably 

physical disk drives in a striped manner like that of the accomplished as follows. When data is stored in one of 

previously described RAID level 5 architecture. A the disk drive units, extra fields (e.g., fields 501 and 502 

stripe comprises corresponding sectors across a set of of FIG. 5) are included in each block of stored data, 

disk drives, some of which sectors contain mass storage These extra fields contain data that identifies where that 

data and others of which sectors contain check data for 50 data should be located in RAID memory. The extra 

the mass storage data sectors within the stripe. A stripe field 501 specifies the logical unit number of the device 

may be one or more sectors deep. Such stripes on a set to which the CPU associated with main memory 101 

of disks are grouped into one or more of what is hereaf- directed the data and field 502 specifies the logical 

ter referred to as redundancy groups. In this arrange- block number of the data block to which the CPU di- 

ment the physical devices comprising the check drives 55 rected the data. 

for a particular stripe varies from stripe to stripe. The A brief discussion is appropriate here concerning 

widths of the stripes (i.e., the number of physical stor- logical units, logical unit numbers, logical blocks and 

age devices spanned by each stripe) are equal within a logical block numbers. A logical unit number (LUN) is 

redundancy group. the number assigned by a CPU to an external mass 
The particular Reed Solomon coding algorithm used 60 storage address space, which may be mapped to one 

determines (or limits) the number, of data blocks that physical mass storage device, a plurality of physical 

can be reconstructed. For example, the Reed Solomon mass storage devices, or any portion of one or more 

code may limit reconstruction to two drives out of the such devices. The LUN is transmitted by the CPU in a 

total number of drives in a stripe (including drives hold- data access command to identify the external device as 
ing check data). If in this case more than two drives in 65 the one to take part in the data access. In response to the 

the stripe are determined to be inconsistent, the Reed logical unit number, various switches within a data path 

Solomon code is incapable of reconstructing any of the from the CPU to the selected external device are set to 

data. As used herein, the term "inconsistent" means that direct the data to or from the device. Known RAID 
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device sets are conventionally operated such that the discrepancy still exists, then an unrecoverable error is 
CPU sees the RAID memory as one logical disk drive reported to the CPU. 

device. A preferred method for configuring data on a When data is read from one of the disk drives 307, the 
set of physical storage devices to operate the set as more logical unit number and logical block number stored 
than one logical storage device is described in the afore- 5 with the data are compared against the expected values 
mentioned co-pending patent application entitled identified from the CPU read command by processor 
"DATA STORAGE APPARATUS AND 314. Such comparison is made both as the data passes 
METHOD." • through drive SCSI interface 410 and as it passes 

In accordance with this preferred method of con fig- through packet staging memory 313 in device control- 
oring data, blocks of data (sector sized) from a single 10 ler 302 on its way to the CPU. If a discrepancy is de- 
write operation from the CPU are written across sev- tected, the data transfer is terminated and the read oper- 
eral physical disk drives although, as far as the CPU is ation is retried. If a discrepancy still exists, then the data 
concerned, it has written data to a single "logical unit " block is cither regenerated using the disk array (e.g., 
typically in one sector increments. Such a logical unit using redundancy data on check disks) or an unrecover- 
comprises one or more data groups. As described in IS able error is reported to the CPU. In addition, a further 
greater detail in the above-referenced patent application recovery operation takes place as follows. The LBN 
entitled "DATA STORAGE APPARATUS AND and LUN read from the data block, which were found 
METHOD," each data group is a logically contiguous to be incorrect, point to another data block within 
group of data blocks (i.e., sectors) bound by a single RAID memory 306. This data block is marked as cor- 
redundancy group. Data groups can be configured as 20 rupted, along with the stripe in which it resides. Subse- 
desired to provide within the RAID memory 304 differ- quent CPU attempts to read or write this stripe will be 
ent logical units having various performance character- rejected until the stripe is reinitialized by the CPU or 
istics. FIG. 4 shows a particular exemplary configura- other means. 

tzon of RAID memory 304 in which several disk drive . — „. . 

units 307 have been grouped into separate logical units 25 3 - De ' ectl0n ° f FaiIure To Wnte 

401 and 402. Each logical unit may separately include Another extra field (505 of FIG. 5) is included in each 

its own check data or alternatively, the two logical units block of stored data to enable the RAID controller 310 

may be incorporated into a larger redundancy group — to detect failures to write due to a drive failure. This 

for example, one formed across all disk units 307 in extra field contains data that identifies a write operation 

active set 308. 30 uniquely. In a preferred embodiment, this field specifies 

The memory of each physical disk drive device is the time at which the write operation is started by 
divided into physical blocks of memory, each of which RAID controller 310, and is referred to herein as a time 
is identified internally in the device by a physical block stamp. As described in Section 6 herein, the time stamp 
number (PBN). A logical block number (LBN) or logi- field can also be used to reconstruct data if a power 
cal block address (LB A) is the number transmitted by a 35 failure interrupts execution of a CPU write request (e.g. t 
CPU to a data storage device to access a block of data a power failure affecting RAID controller 310). 
identified by this number, in a physical disk drive unit, Before any write operations are started on any disks, 
some of the physical blocks may be bad and other physi- a time value is read from a real time clock 414 of FIG. 
cal blocks may be heeded for overhead operations and 4 and is stored in register 412 in the drive SCSI inter- 
are therefore not available to accept user data. A unique 40 faces 410 associated with the write request. The write 
LBN or LBA is assigned to each physical block of a operations are then started and the time stamp that was 
logical unit that is available for user data. written into the drive SCSI interfaces 410 is appended 

Referring now to FIGS. 3 and 4, the detection of to each data block associated with the write request 
incorrectly routed data is illustrated for the case of data (including blocks of check data), thereby storing the 
passing through device controller 302 to and/or from 45 CPU data, the associated prepended data and the associ- 
RAID memory 304. Device controller 302 includes a ated appended data into RAID memory, 
processor 314 that interprets CPU commands, identifies . In response to each read request from a CPU, for all 
the appropriate logical unit number and the logical data blocks in each data group that are read to satisfy 
block number with which a command is concerned and that read request, the time stamps stored with the data 
transmits this information to RAID memory 304. When 50 are compared against each other by the following pro- 
data is written to a logical unit (such as logical unit 401 cedure. In each drive SCSI interface 410 of multiple 
or 402 in FIG. 4) within RAID memory 304, the logical drive SCSI interface 409, the time stamp from the data 
unit number and logical block number are prepended to block is loaded into a register 412 dedicated to holding 
the data block received from the CPU while the data is such time stamps and all such time stamp registers 
being held in a packet staging memory 313 within de- 55 within multiple drive SCSI interface 409 that are associ- 
vice controller 302. Subsequently, in one of the SCSI ated with the read request are compared using compare 
(Small Computer System Interface) interfaces 410 circuitry within multiple drive SCSI interface 409. All 
within multiple drive SCSI interface 409 of RAID 304, of the time stamps are expected to be equal. If a discrep- 
the data is routed to the appropriate disk drive units ancy is detected, then the read request is retried. If the 
within RAID memory 304. However, before transfer- 60 discrepancy is again detected and the number of disks 
ring the data block to a particular disk 307, the logical containing an older time stamp is within the limit that 
unit number and logical block number prepended to the can be reconstructed using the check disk(s), then the 
data are checked against expected values previously sectors in the devices holding older data are recon- 
transmitted to RAID memory 304 by processor 314 of stmcted to bring the data up to date with the most 
device controller 302, This check takes place while the 65 recent (i.e. newest) time stamp on the data blocks in- 
data block is passing through multiple drive SCSI inter- volyed in the read request. If the number of disks con- 
face 409. If the expected and received values do not taining an older time stamp is not within the limit that 
agree, the transfer of the data block is retried and, if a can be reconstructed using the check disk(s), then a 
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nonrecoverable error is reported to the CPU so that which the block or blocks are located will be accessed, 

corrective action can be taken such as calling for For a write operation, one or more disk drives contain- 

backup tapes to reconstruct the data. In addition, the ing check data are accessed in addition to the drive or 

stripe must be declared as corrupted and subsequent drives on which the block or blocks of data are located, 

data accesses to it must be rejected until the CPU or 5 Assuming, however, that only a single drive is involved 

other means reinitializes the stripe. in the read operation, a comparison check of the time 

stamp associated with the requested data block or 
blocks can not be accomplished in the manner previ- 
FIG. 5 shows a preferred arrangement for a sector- ously described to validate the data because no other 
sized block of data as it is stored on a disk drive in 10 drives are accessed in the read, 
accordance with the principles of the present invention. FIGS. 6 and 7 illustrate an embodiment of the time 
As indicated in FIG. 5, each data block 500 stored in a stamp aspect of the present invention particularly pre- 
disk drive preferably has several error checking fields in ferred for transaction processing applications. FIG. 6 
addition to the CPU data 503. The first error checking - shows an array 600 of physical storage devices 601-606. 
fields 501 and 502 are error checking fields prepended 15 Devices 601-604 store blocks of transaction data. De- 
by the device controller 302 during a CPU write re- vices 605 and .606 operate as check drives for the array 
quest and stripped by device controller 302 during a and are used to regenerate data if one or two of devices 
CPU read request. In this embodiment, these error 601-604 fails. It is to be understood also that if one or 
checking fields contain the logical unit number 501 and both of devices 605 and 606 fail, the check data stored 
the logical block number 502 for the associated CPU 20 on these drives can be reconstructed from the data on 
data 503 contained in that data block. Inclusion of these devices 601-604. Within array 600 are defined two data 
fields allows the disk, storage system to detect misdi- groups 615 and 616. Each data group may comprise a 
rected data blocks as previously described. separate logical unit (e.g., logical unit 401 of FIG. 4), or 
The third field is the CPU data block 503 as sent from together they may be included within a larger logical 
or to CPU bus or channel 319. The fourth field is a CRC 25 unit (e.g., logical unit 402 of FIG. 4). Data group 615 
code 504 appended by device controller 302 on trans* includes devices 601 and 602, and data group 616 in- 
mission to RAID controller 310 and checked by RAID eludes devices 603 and 604. Data is transferred between 
controller 310. CRC code 504 is checked again and each of devices 601-606 and a system bus 608 (e.g., bus 
stripped by device controller 302 on receipt from 406 of FIG. 4) via a corresponding one of buffer memo- 
RAID controller 310. Inclusion of this field 504 allows 30 ries 609-614 (e.g., buffers 407 of FIG. 4). When array 
the disk storage system to detect random data errors 600 is operated in transaction mode, such that a write or 
occurring on the bus between the device controller and read request may concern only a single block of data, all 
the RAID controller. accesses to data on any of devices 601-604 causes both 
The fifth field is a time stamp 505 appended by RAID devices of the data group including the device on which 
controller 310 on a write operation and checked and 35 the data block is located to be accessed. This applies to 
stripped by RAID controller 310 on a read operation. both write and read requests. Thus, for example, if a 
Inclusion of this field allows, the disk storage system to block of data is to be written to only device 601, both 
detect the failure to write and/or retrieve the correct device 601 and device 602 will be accessed together in 
sector due to disk drive failures and/or power failures. the same write command issued to data group 615. The 
The sixth field is a CRC code 506 appended by the 40 new host data block will be written to device 601 with 
RAID controller on a write operation and checked and an appended time stamp of the type previously de- 
stripped by the RAID controller on a read operation. scribed. Although no new host data is written to device 
As previously described, inclusion of this field allows 602, the same time stamp written to device 601 is writ- 
the disk storage system to detect random bit errors ten to the block location on device 602 corresponding 
occurring within the data block covering the additional 45 to the block location on device 601 in which the new 
device controller CRC 504 and time stamp 505 fields, host data is written and to 605 and 606 on the corre- 
during transmission between the disk and* the RAID spending check data blocks. On a subsequent read re- 
controller, quest concerning the data block on device 601, the time 
The seventh field contains the results of an error stamps on devices 601 and 602 are compared. This corn- 
correction code (ECC) calculation 507 appended by the 50 parison of time stamps is made to ensure that new data 
disk drive on a write operation and checked and was written to device 601 when the write command to 
stripped by the disk drive on a read operation. Inclusion data group 615 was issued. 

of this field allows the disk storage system to detect and A write command to a data group is typically accom- 
possibly correct random bit errors occurring in the plished by a read-modify-write operation for purposes 
serial channel from the disk drive to disk platter and 55 of updating the check data on devices 605 and 606. This 
other media errors. operation involves first reading the old data in the block 
Additional fields may be provided for purposes of to be written, as well as the old data in the corres pond- 
performing other data handling functions. For example, ing block of the other device in the data group and the 
the disk drive may append a track identification number check data associated with those blocks. For example, 
and a sector identification number to the stored data for 60 assuming again that new data is to be written to a block 
internal drive operations. location in device 601, the old data in the block location 
_ ' . " , is read into buffer 609. At the same time, the old data in 
5. Time Stamping In Transaction Mode a corresponding b j 0 ck location in device 602 (which is 

A RAID memory may be operated in a transaction not to be changed) is read into buffer 610. Also, the old 

processing mode where data accessed by a CPU write 65 check data on devices 605 and 606 is read into buffers 

or read request comprises a single block or number of 613 and 614. Then, the data in buffer 609 is updated, as 

blocks of data (e.g., sectors) on a logical unit disk. For is the check data in buffers 613 and 614. The contents of 

a read operation, only the particular drive or drives on buffers 609, 610, 613 and 614 are then written respec- 
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lively to devices €01, 602, 605 and 606. During this write operations were completed; or (3) all of the write 

write operation, a time stamp is appended to the data operations were completed. 

transferred to devices 601 and 602, as well as to the A fourth possibility, for the following reasons, is so 

check data transferred to devices 605 and 606. remote as not to be of significant concern. This possibil- 

Although the array 600 of FIG. 6 is arranged such 5 ity is that a write operation on a disk is discontinued 

that check data for the entire array is located on devices part way through writing a data block onto a disk plat- 

605 and 606, it is to be understood that the data group ter. When power fails, there is sufficient energy stored 

configuration can be used as well in arrays in which the to allow the disks to continue writing for multiple milli- 

check data is distributed throughout the devices of the seconds, which is more than enough time to complete 

array, as in RAID level 5 or in any of the preferred data *° any operations that had progressed to the point that 

structures described in the previously-referenced co- data was actually being transferred to the disk platters, 

pending patent application entitled "DATA STOR- It is much more likely that, during a power failure, some 

AGE APPARATUS AND METHOD". disks were in the process of seeking the heads or waiting 

In addition, although data groups 615 and 616 are for the correct sector to come under the heads. In these 
shown as each comprising two physical devices, such 15 cases, there may not have been sufficient. time to corn- 
data groups may comprise any plurality of physical plete the operation in the event of a power failure, 
devices, or portions of any plurality of physical devices, Therefore, before any write operation is started on 
and may as well be used for applications other than any disk, within a nonvolatile memory 413 is stored a 
transaction processing, as set forth for example in the journal of information concerning the CPU write re- 
previously-referenced co-pending patent application 20 quest and the write operations to be performed. The 
entitled "DATA STORAGE APPARATUS AND data stored within nonvolatile memory 413 is intended 
METHOD." to assist in recovering from a write request interrupted 

FIG. 7 illustrates an example of how a series of data by a power failure. Nonvolatile memory 413 is prefera- 
blocks each of sector size can be written to and read bly battery backed-up random access memory or elec- 
from devices 601 and 602 configured as a single data trically erasable programmable read-only memory, 
group having logically contiguous sectors numbered Nonvolatile memory is used so that this information is 
1-6. For purposes of illustration, assume that sectors 1 not lost if a power failure occurs, thereby enabling such 
and 2 are a pair of corresponding sectors of devices 601 data to be utilized in recovering from such power fail- 
and 602 respectively. Likewise, assume sectors 3 and 4, ^ ure. Successful recovery from such an incomplete write 
and sectors 5 and 6 are corresponding pairs of sectors in operation means that all data blocks across the redun- 
devices 601 and 602, respectively. New data may be dancy group stripe that was modified by the write oper- 
written to an individual sector of either device 601 or ations associated with the CPU write request are consis- 
602, or new data may be written to corresponding sec- tent with the check data for that stripe, 
tors of devices 601 and 602 in parallel, but in either case 35 Some, and preferably all, of the following informa- 
both devices 601 and 602 are accessed for each transfer. tion is loaded into nonvolatile memory 413 before the 
For example, when writing new data to either sector 1 start of any write operation: (I) a write process flag— an 
of device 601 or sector 2 of device 602 or to both, a read indicator that a write operation was underway when 
operation is performed first in which the old data in power was removed; (2) an operation sequence num- 
sectors 1 and 2 is read into buffers 609 and 610 respec- ber — a number assigned to the write command when 
tively. The data in one or both buffers is modified ap- received from the CPU indicating the order of com- 
propriateiy with the new data, and the data in the buff- mand reception; (3) a physical drive address; (4) a start- 
ers are written back to the devices 601 and 602. As ing logical block number, (5) an ending logical block 
indicated by box 700, a time stamp is appended to both number, or an indication of the size of the write opera- 
sectors 1 and 2 as the data is transferred along paths A 45 tion; (6) a time stamp; (7) and the physical addresses of 
and B to devices 601 and 602 respectively. When read- check drive(s) associated with the transfer, 
ing from either sector 1 or 2 or both, the data from both After all write operations occurring on drives within 
sectors is transferred to buffers 609 and 610, and the a logical unit (e.g., logical unit 402) associated with a 
corresponding time stamps stored with sectors 1 and 2 write request are completed, the time stamp and other 
are compared during the transfer as indicated by box 50 information associated with that write request are 
702. As an example, the functions of appending and erased from the nonvolatile memory 413 by processor 
comparing time stamps may be accomplished in a drive 411. 

interface circuit such as SCSI drive interface circuit 410 If a power failure occurs affecting RAID controller 

of FIG. 4. If a discrepancy between the stamps is de- 310, processor 411 analyzes the "consistency" of each 

tected, indicating that a previous write to the devices 55 redundancy group as part of its initialization procedure 

601 and 602 was not successfully completed, the read when power is restored to the RAID controller. To do 

operation is retried. If the discrepancy reoccurs, then so, it scans each write in progress journal stored within 

either reconstruction is attempted or an error report is nonvolatile memory 413. If all journals have been 

generated as previously described. erased within nonvolatile memory 413, then processor 

6. Power Failure Interrupted Write Operation 60 *" J™£ *" X n ° f WritC T^^trT ^1 1 C ° m " 

r *^ pleted at the time of power failure. If the journal has not 

A power failure could occur at any time during exe- been erased within nonvolatile memory 413, then pro- 

curion of a CPU write request causing an interruption of cessor 411 determines which disks 307 and which sec- 

write operations associated with the write request. If tors on these disks were being written in response to the 

such a power failure does occur (for example, the 65 write request by reading the contents of the journal 

RAID controller loses power), then the write request stored in nonvolatile memory 413. Processor 411 then 

can end in any one of the following three states: (1) none causes data blocks from those sectors to be read from 

of the write operations were completed; (2) some of the disks 307 to the RAID buffers 407 and then compares 
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the time stamps from each data block with the expected 
value as read, from nonvolatile memory 413. 

If none or all of the data blocks associated with the 
write request were written with new data (i.e., either 
none or all of the time stamps have the same value as in 5 
nonvolatile memory 413), processor 411 deletes the 
nonvolatile memory entry for the write request, thereby 
indicating that the recovery operation was successfully 
completed. If some of the data blocks associated with 
the write request were written and some were not, then 10 
processor 411 determines whether it is within the error 
correcting capabilities of the RAID controller, using 
redundancy group error correction circuitry 408, to 
reconstruct the data blocks that have the oldest time 
stamp to bring them up to date with the newest data 15 
blocks (i.e, the data blocks that were successfully writ- 
ten before the power failure interrupted the write oper- 
ation). When possible, processor 411 carries out proce- 
dures to regenerate data where the old data resides and 
then deletes the nonvolatile memory entry for the write 20 
request. 

If processor 411 determines that the blocks with old 
data cannot be reconstructed and it is within the error 
correcting capabilities of correction circuitry 408 to 
reconstruct the data blocks that have the ne time stamp 25 
(thereby bringing the data blocks back to the state just 
prior to the write operation), then processor 411 carries 
out procedures to do that, and deletes the nonvolatile 
memory entry for the write request. 

If none of the above scenarios is possible, processor 30 
411 signals an unrecoverable error to all device control- 
lers 301-302 to which RAID memory 304 is connected. 
In turn, all device controllers 301-302 thus signalled 
will report this unrecoverable error to all CPUs to 
which they are connected. In addition, any further data 35 
requests to the corrupted area are rejected until the 
problem is corrected. 

Although an embodiment has been described in 
which data is stored on nonvolatile memory 413 at the 
beginning of every write operation, the RAID memory 40 
may include a power supply having a power failure 
early warning system that can eliminate the need to 
store data in nonvolatile memory 413 at the beginning 
of every Write operation. Such early warning systems 
are provided as an option in many conventional power 45 
supplies. These early warning systems are capable of 
detecting the onset of a power failure in advance of the 
actual failure, and can be used to generate an interrupt 
to notify a processor that a power failure is imminent. 
By so using a conventional power failure early warning 50 
system to generate an interrupt signal to processor 411, 
processor 411 is provided sufficient warning to allow it 
to store data concerning a pending write operation in 
nonvolatile memory 413 before power actually fails. 
Thus, in such a case there is no need to store data in 55 
nonvolatile memory 413 at the beginning of every write 
operation, since that same data can be stored in the 
non-volatile memory in the event of a power failure. 

The execution of a CPU write request and a CPU 
read request by RAID memory 304 is described hereaf- 60 
ter to further illustrate how the various aspects of the 
present invention can be integrated in the operation of 
RAID memory 304. 

7. CPU Write Request w 

In a CPU write request, device controller 302 re- 
ceives a request to write a certain amount of data to a 
certain logical unit number, starting at a certain logical 
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block number. The request is staged in packet staging 
memory 313 and is read and interpreted by processor 
314. A request is forwarded to RAID controller 310 
through protocol controller 315 and bus 312 and is read 
and stored by protocol controller 404. Protocol control- 
ler 404 signals processor 411 via a bus 405 that a request 
is ready to be processed and processor 411 then reads 
and interprets the write request. Note that protocol 
controller 403 handles requests to RAID 304 from de- 
vice controller 301. 

Processor 411 determines whether the write request 
to the logical unit number translates to write operations 
on disks contained within the array 306 of disks (e.g., 
logical unit 401 or 402), and, if it does, then sends com- 
mands to those disks through each of their associated 
drive SCSI interfaces 410. Processor 411 signals proces- 
sor 314 in device controller 302 to start sending data to 
buffers 407 of RAID memory 304. Processor 411 also 
reads the current time of day from clock 414 and loads 
the nonvolatile memory 413 with information relating 
to the write operations that are about to start Processor 
411 also writes the time of day into a register 412 in each 
drive SCSI interface 410 associated with a disk drive 
unit 307 that will be involved in the write request. Pro- 
cessor 411 also writes registers 412 in these same drive 
SCSI interfaces with the expected logical unit number 
and logical block number for the block of data arriving 
from the CPU. 

Processor 314 signals the CPU to send data to packet 
staging memory 313 in device controller 302. In re- 
sponse, the CPU sends data block packets which are 
staged in packet staging memory 313. From the header 
information attached to the CPU data, processor 314 
determines for which logical unit number and logical 
block number each packet is intended and prepends that 
information to the data block. A set of data blocks is 
sent to RAID controller 310 where it is temporarily 
stored in the buffers 407 corresponding to the disks for 
which each data block is intended. This data is transmit- 
ted from protocol controller 404 to these buffer memo- 
ries over bus 406. The data blocks are then transferred 
to the corresponding drive SCSI interfaces 410 where 
the logical unit number and logical block number are 
compared against the expected values previously 
loaded into registers 412 at interfaces 410. If the values 
match, then each of these drive SCSI interfaces trans- 
fers its data block to its associated disk 307 and appends 
the time of day from its register 412 onto the data block. 
After all disk memory write operations for this write 
request have been completed, processor 411 erases the 
time stamp and other data in the nonvolatile memory 
413 associated with this write request. If the logical 
block number or the logical unit number prepended to 
the data does not match the logical unit- number and 
logical block number stored in the register 412 for that 
drive SCSI interface 410, then the operation is retried or 
an unrecoverable error is rejwrted to the CPU. 

8. CPU Read Request 

In a'CPU read request, device controller 302 receives 
a request to read a specified amount of data from a 
specified logical unit number, starting at a specified 
logical block number. The request is staged in the 
packet staging memory 313 and is read and interpreted 
by processor 314. A request is forwarded over bus 312 
to the RAID controller 310 via protocol controller 315 
and is read and stored by protocol controller 404. Pro- 
tocol controller. 404 signals processor 411 that a request 
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is ready to be processed and the processor reads and 
interprets the read request. 

Processor 411 determines that the read request to the 
logical unit number translates to read operations on 
disks contained within set 306 and sends commands to 5 
those disks through each of their associated drive SCSI 
interfaces within multiple drive SCSI interface 409. 
Processor 411 also loads register 412 in each of these 
drive SCSI interfaces 410 with the expected logical unit 
number and logical block number for the block of data 1° 
arriving from the associated disk. 

Data starts arriving into multiple device SCSI Inter- 
face 409 from those disk drive units within the indicated 
logical unit. At each drive SCSI interface 409 within 
this logical unit, the logical block number and logical 15 
unit number for each block of data are checked against 
the values previously loaded into registers 412 by pro- 
cessor 411. The time of day appended at the end of each 
data block is compared by multiple drive SCSI interface 
409 with all of the others associated with the same read 
request and the same stripe. If the time stamps of all 
accessed data blocks are equal, then the transfer of these 
data blocks to their associated buffers 407 begins. The 
appended time stamp is stripped from each block as it is 
transferred to its associated buffer 407. 

When all blocks have been transferred, processor 411 
signals processor 314 that the data block(s) are ready to 
be sent to packet staging memory 313. Protocol control- 
lers 404 and 315 carry out the transfer of the data 3Q 
block(s) from one or more of the buffers 407 to packet 
staging memory 313. As each data block is transferred 
to packet staging memory 313, processor 314 again 
checks the logical unit number and logical block num- 
ber contained in the data block against the expected 35 
value stored in processor 314 and strips this prepended 
data from the data block to send the remainder of the 
data block to the CPU. 

If a discrepancy occurs in any of these comparisons 
anywhere in the RAID controller or device controller, 4$ 
the transfer of data is aborted and the aborted read 
operation is retried. In the case of detection of misdi- 
rected data, where the detection occurs at multiple 
SCSI drive interface 409, a further recovery operation 
takes place as follows: (I) the LUN and LBN from the 45 
failing data blocks are read from processor 411; and (2) 
the data block in RAID memory 306 indicated by this 
LUN and LBN is marked as corrupted along with the 
stripe in which it resides. Subsequent CPU attempts to 
read or. write this stripe will be rejected until the stripe 50 
is reinitialized by the CPU or other means. If the failure 
reoccurs and if it is within the limits of the error cor- 
recting capabilities of the redundancy group error cor- 
rection circuitry 408, then the failing data block is re- 
generated using the disk array including the check 55 
disk(s) and correction circuitry 408. If the failure reoc- 
curs and is not within the. limits of the error correcting 
algorithms (because too many data blocks have failed), 
then an unrecoverable error is reported to the CPU. 

Thus it is seen that the present invention provides 60 
ways for detecting and correcting errors in a multiple 
device mass storage system resulting from power fail- 
ure. One skilled in the art will appreciate that the pres- 
ent invention can be practiced by other than the de- 
scribed embodiments, which are presented for purposes 65 
of illustration and not of limitation, and the present 
invention is limited only by the claims which follow. 

We claim: 



1. A memory for a computer system having a central 
processing unit, the memory comprising: 

a plurality of physical blocks of memory for storing 
data, said physical blocks being distributed among 
a set of physical devices operable as one or more 
logical units; 

multiple storage device controller means connected 
to the set of physical devices for routing data to 
said physical blocks of memory in response to a 
write request from the central processing unit, said 
multiple storage device controller means including 
processor means for controlling the routing of data 
to said physical blocks; 

nonvolatile storage means, including a programmable 
semiconductor memory circuit coupled to said 
processor means in the multiple storage device 
controller means, for retaining programmed infor- 
mation in the absence of power to the multiple 
storage device controller means; 

first means for programming the nonvolatile storage 
means with information indicating that a write 
operation involving the plurality of blocks of data 

- is in progress and information uniquely identifying 
the write operation; 

second means for storing with each of the plurality of 
blocks of data information uniquely identifying the 
most recent write operation involving the block of 
data; 

means, responsive to the completion of a write opera- 
tion, for erasing from the nonvolatile storage 
means the information that indicates that the write 
operation was in progress; 

means, responsive to a power failure in the multiple 
storage device controller means which interrupts 
execution of a write request by said processor 
means, for checking whether the information in the 
nonvolatile storage means that indicates that the 
write operation was in progress has been erased; 
and 

means for checking, for each block of data involved 
in a write operation in progress at the time of a 
power failure in the multiple storage device con- 
troller means, the information stored by said fist 
storing means uniquely identifying the write opera- 
tion with the information stored with the block in 
memory by said second storing means uniquely 
identifying the most recent write operation involv- 
ing the block of data to determine if there is a dis- 
parity indicating that the block in memory was not 
stored during the power failure interrupted write 
operation. 

2. A method of storing data in a memory for a com- 
puter system having a central processing unit, said 
memory comprising a set of physical storage devices 
operable as one or more logical units, the method com- 
prising the steps of: 

dividing said memory into a plurality of physical 
blocks of memory, said physical blocks being dis- 
tributed among the set of physical storage devices 
operable as one or more logical units; 

routing data to said physical blocks of memory in 
response to a write request from the central pro- 
. cessing unit using a multiple storage device con- 
troller means connected to the set of physical stor- 
age devices; 

providing a nonvolatile storage means, including a 
programmable semiconductor memory circuit in 
the multiple storage device controller, for retaining 
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programmed information in the absence of power 
to the multiple storage device controller means; 
programming said nonvolatile storage means with 
information indicating that a write operation in- 
volving the pluralil y of blocks of data is in progress 5 
and information uniquely identifying the write 
operation; 

storing with each of the plurality of blocks of data 
information uniquely identifying the most recent 
write operation involving the block of data; 10 

in response to the completion of a write operation, 
erasing the information in the non-volatile storage 
means that indicates that the write operation was in 
progress; 

in response to a power failure in the multiple storage 15 
device controller means which interrupts execu- 
tion of a write request, checking whether the infor- 



mation that indicates that the write operation was 
in progress has been erased, and if it has not been 
erased, then initiating steps to determine what por- 
tion of information to be stored in physical blocks 
of memory for that write operation was hot stored; 
and 

checking for each block of data involved in the write 
operation the information stored in the nonvolatile 
storage means uniquely identifying the write opera- 
tion with the information stored with the block in 
memory uniquely identifying the most recent write 
operation involving the block of data, and if there 
is a disparity, then concluding that the block in 
memory was not stored during the power failure 
interrupted write operation. 
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